Repairing Regular Expressions for Extraction

نویسندگان

چکیده

While synthesizing and repairing regular expressions (regexes) based on Programming-by-Examples (PBE) methods have seen rapid progress in recent years, all existing works only support or regexes for membership testing, the extraction is still an open problem. This paper fills void by proposing first PBE-based method extraction. Our work supports that real-world extensions such as backreferences lookarounds. The significantly affect synthesis repair In fact, we show there are unsolvable instances of problem if synthesized not allowed to use extensions, i.e., no regex without correctly classify given set examples, whereas every instance solvable allowed. stark contrast case where guaranteed a solution expressible pure extensions. main contribution algorithm solve builds enumerative search algorithms with SMT constraint solving. However, significant needed because constraints previous non-deterministic semantics regexes. Non-deterministic sound but extraction, which substrings extracted depends deterministic behavior actual engines. To address issue, propose new generation respects For this, define novel formal engine big-step operational semantics, it basis design method. key idea simulate determinism consider continuations matching them disambiguation. We also two space pruning techniques called approximation-by-pure-regex approximation-by-backreferences make information examples. implemented tool R3 (Repairing Regex extRaction) evaluated 50 contain evaluation shows effectiveness our substantially prune space.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Repairing Data through Regular Expressions

Since regular expressions are often used to detect errors in sequences such as strings or date, it is natural to use them for data repair. Motivated by this, we propose a data repair method based on regular expression to make the input sequence data obey the given regular expression with minimal revision cost. The proposed method contains two steps, sequence repair and token value repair. For s...

متن کامل

Repairing Regular Expressions by Adding Missing Words

Regular expressions are used in many information extraction systems like YAGO, DBpedia, Gate and SystemT. However, they sometimes do not match what their creator wanted to find. We investigate how missing words can be added automatically to a regular expression by creating disjunctions at the appropriate positions. Our demo visualizes the steps that our algorithm employs to repair the regular e...

متن کامل

Efficient Submatch Extraction for Practical Regular Expressions

Internal Posting Date: March 6, 2012 [Fulltext]  Efficient Submatch Extraction for Practical Regular Expressions Stuart Haber, William Horne, Pratyusa Manadhata, Miranda Mowbray, Prasad Rao HP Laboratories HPL-2012-41R1 regular expressions; submatch extraction; capturing groups A capturing group is a syntax used in modern regular expression implementations to specify a subexpression of a regul...

متن کامل

Explanations for Regular Expressions

Regular expressions are widely used, but they are inherently hard to understand and (re)use, which is primarily due to the lack of abstraction mechanisms that causes regular expressions to grow large very quickly. The problems with understandability and usability are further compounded by the viscosity, redundancy, and terseness of the notation. As a consequence, many different regular expressi...

متن کامل

Regular Expressions for Provenance

As noted by Green et al. several provenance analyses can be considered a special case of the general problem of computing formal polynomials resp. power-series as solutions of an algebraic system. Specific provenance is then obtained by means of evaluating the formal polynomial under a suitable homomorphism. Recently, we presented the idea of approximating the least solution of such algebraic s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ACM on programming languages

سال: 2023

ISSN: ['2475-1421']

DOI: https://doi.org/10.1145/3591287